{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The ``inline`` flag will use the appropriate backend to make figures appear inline in the notebook.  \n",
    "%matplotlib inline\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# `plt` is an alias for the `matplotlib.pyplot` module\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# import seaborn library (wrapper of matplotlib)\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matplotlib Wrappers (Pandas and Seaborn)\n",
    "\n",
    "Matplotlib is a very popular visualization library, but it definitely has flaws.\n",
    "\n",
    "1. Matplotlib defaults are not ideal (no grid lines, white background etc).\n",
    "2. The library is relatively low level. Doing anything complicated takes quite a bit of code. \n",
    "3. Lack of integration with pandas data structures (though this is being improved).\n",
    "\n",
    "In this video, we are going to make a more complicated visualization called a boxplot to show how helpful it is to work with the matplotlib wrappers pandas and seaborn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is a boxplot\n",
    "![](images/boxplot.png)\n",
    "A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed. If you want to learn more about how boxplots, you can learn more [here](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Data\n",
    "\n",
    "The data used to demonstrate boxplots is the Breast Cancer Wisconsin (Diagnostic) Data Set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The goal of the visualization is to show how the distributions for the column `area_mean` differs for benign versus malignant `diagnosis`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load wisconsin breast cancer dataset\n",
    "# either benign or malignant\n",
    "cancer_df = pd.read_csv('data/wisconsinBreastCancer.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cancer_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Looking at the Distribution of the Dataset in terms of Diagnosis\n",
    "cancer_df['diagnosis'].value_counts(dropna = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting using Matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "malignant = cancer_df.loc[cancer_df['diagnosis']=='M','area_mean'].values\n",
    "benign = cancer_df.loc[cancer_df['diagnosis']=='B','area_mean'].values\n",
    "\n",
    "plt.boxplot([malignant,benign], labels=['M', 'B']);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting using Pandas\n",
    "Pandas can be used as a wrapper around Matplotlib. One reason why you might want to plot using Pandas is that it requires less code. \n",
    "\n",
    "We are going to create a boxplot to show how much less syntax you need to create the plot with pandas vs pure matplotlib. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Getting rid of area_mean \n",
    "cancer_df.boxplot(column = 'area_mean', by = 'diagnosis');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes you will find it useful to use Matplotlib syntax to adjust the final plot output. The code below removes the suptitle and title using pure matplotlib syntax. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Same plot but without the area_mean subtitle and title\n",
    "cancer_df.boxplot(column = 'area_mean', by = 'diagnosis');\n",
    "plt.title('');\n",
    "plt.suptitle('');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting using Seaborn\n",
    "Seaborn can be seen as a wrapper on top of Matplotlib. [Seaborn's website](https://seaborn.pydata.org/introduction.html) lists a bunch of advantages of using Seaborn including\n",
    "\n",
    "* Close integration with pandas data structures\n",
    "* Dataset oriented API for examining relationships between multiple variables. \n",
    "* Specialized support for using categorical variables to show observations or aggregate statistics. \n",
    "* Concise control over matplotlib figure styling with several built-in themes. \n",
    "* Tools for choosing color palettes that faithfully reveal patterns in your data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "sns.boxplot(x='diagnosis', y='area_mean', data=cancer_df)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
